Segmentation Methods for Recognition of Machine-Printed Characters
نویسندگان
چکیده
This paper reports an investigation of some methods for isolating, or segmenting, characters during the reading of machineprinted text by optical character recognition systems. Two new segmentation algorithms using feature extraction techniques are presented; both are intended for use in the recognition of machine-printed lines of lo-, 11and 12-pitch serif-type multifont characters. One of the methods, called quasi-topological segmentation, bases the decision to “section” a character on a combination of featureextraction and character-width measurements. The other method, topological segmentation, involves feature extraction alone. The algorithms have been tested with an evaluation method that is independent of any particular recognition system. Test results are based on application of the algorithm to upper-case alphanumeric characters gathered from print sources that represent the existing world of machine printing. The topological approach demonstrated better performance on the test data than did the quasitopological approach. Introduction When character recognition systems are structured to recognize one character at a time, some means must be provided to divide the incoming data stream into segments that define the beginning and end of each character. Writing about this aspect of pattern recognition in his review article, G. Nagy [l] stated that “object isolation is all too often ignored in laboratory studies. Yet touching characters are responsible for the majority of errors in the automatic reading of both machine-printed and hand-printed text. . . . ” The importance of the touching-character problem in the design of practical character recognition machines motivated the laboratory study reported in this paper. We present two new algorithms for separating upper-case serif characters, develop a general philosophy for evaluating the effectiveness of segmentation algorithms, and evaluate the performance of our algorithms when they are applied to lo-, 11and 12-pitch alphanumeric characters. The segmentation algorithms were developed specifically for potential use with recognition systems that use a raster-type scanner to produce an analog video signal that is digitized before presentation of the data to the recognition logic. The raster is assumed to move from right to left across a line of printed characters and to make approximately 20 vertical scans per character. This approach to recognition technology is the one most commonly used in IBM’s current optical character recognition machines. A paper on the IBM 1975 Optical Page Reader [2] gives one example of how the approach has been implemented. Other approaches to recognition technology may not require that decisions be made to identify the beginning and end of characters. Nevertheless, the performance of any recognition system is affected by the presence of touching characters and the design of recognition algorithms must take the problem into account (see Clayden, Clowes and Parks [3]). Simple character recognition systeMs of the type we are concerned with perform segmentation by requiring that bit patterns of characters be separated by scans containing no “black” bits. However, this method is rarely adequate to separate characters printed in the common business-machine and typewriter fonts. These fonts, after all, were not designed with machine recognition in mind; but they are nevertheless the fonts it is most desirable for a machine to be able to recognize. In the 12-pitch, serif-type fonts examined for the present study, up to 35 percent of the segments occurred not at blank scans, but within touching character pairs. 153 SEGMENTATION ALGORITHMS MARCH 1971
منابع مشابه
Character Segmentation Scheme for OCR System: For Myanmar Printed Documents
Automatic machine-printed Optical Characters or texts Recognizers (OCR) are highly desirable for a multitude of modern IT applications, including Digital Library software. However, the state of the art OCR systems cannot do for Myanmar scripts as the language poses many challenges for document understanding. Therefore, the authors design an Optical Character Recognition System for Myanmar Print...
متن کاملCharacter Segmentation Scheme for OCR System : For Myanmar Printed
Automatic machine-printed Optical Characters or texts Recognizers (OCR) are highly desirable for a multitude of modern IT applications, including Digital Library software. However, the state of the art OCR systems cannot do for Myanmar scripts as the language poses many challenges for document understanding. Therefore, the authors design an Optical Character Recognition System for Myanmar Print...
متن کاملOCR for printed Kannada text to Machine editable format using Database approach
This paper describes an Optical Character Recognition (OCR) system for printed text documents in Kannada, a South Indian language. The proposed OCR system for the recognition of printed Kannada text, which can handle all types of Kannada characters. The system first extracts image of Kannada scripts, then from the image to line segmentation then segments the words into sub-character level piece...
متن کاملSegmentation Problems and Solutions in Printed Degraded Gurmukhi Script
Character segmentation is an important preprocessing step for text recognition. In degraded documents, existence of touching characters decreases recognition rate drastically, for any optical character recognition (OCR) system. In this paper we have proposed a complete solution for segmenting touching characters in all the three zones of printed Gurmukhi script. A study of touching Gurmukhi cha...
متن کاملAn Efficient OCR for Printed Malayalam Text using Novel Segmentation Algorithm and SVM Classifiers
This paper describes an Optical Character Recognition (OCR) System for printed text documents in Malayalam, a South Indian language. Indian scripts are rich in patterns while the combinations of such patterns makes the problem even more complex and these complex patterns are exploited to arrive at the solution. The system segments the scanned document image into text lines, words and further ch...
متن کاملذخیره در منابع من
با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید
عنوان ژورنال:
- IBM Journal of Research and Development
دوره 15 شماره
صفحات -
تاریخ انتشار 1971